TRR: Target-relation regulated network for sequential recommendation

Item co-occurrence is an important pattern in recommendation. Due to the difference in correlation, the matching degrees between the target and historical items vary. The higher the matching degree, the greater probability they co-occur. Recently, the recommendation performance has been greatly improved by leveraging item relations. As an important bond imposed by relations, these connected items should have a strong correlation in the calculation of certain measures. This kind of correlation can be the biased knowledge that benefits parameter training. Specifically, we focus on tuples containing the target item and latest relational items that have relations such as complement or substitute with the target item in user’s behavior sequence. Such close relations mean the matching degrees between relational items and historical items should be highly affected by that of the target item and historical items. For example, given a relational item having relation complement with the target item, if the target item has high matching degrees with some items in user’s behavior sequence, this complementary item should behave similarly for the co-occurrence of complementary items. Under guidance of the above thought, in this work, we propose a target-relation regulated mechanism which converts the biased knowledge of high correlation of matching degrees into a regulation. It integrates the target item and relational items in history as a whole to characterize the matching score between the target item and historical items. Experiments conducted on three real-world datasets demonstrate that our model can significantly outperform a set of state-of-the-art models.


Introduction
Due to the overwhelming data that people are facing on the Internet, recommendation is becoming increasingly important. It can help to alleviate the problem of information overload in fields like e-commerce for retrieving information and discovering contents. Most existing recommendation models work on implicit feedback like purchasing records to learn personalized preference. Although effective, it is very challenging to further improve the performance due to the problem of data sparsity. Recently, some works have taken item relations into account and an example illustrating such relations is shown in Fig 1. Item relations like complement and substitute can help narrow the scope of recommendation, which greatly alleviate relational item as a whole to characterize the matching score. This regulated matching score is then used to determine whether the target item is the one we are interested in the next step. These steps convert the aforementioned bias knowledge to the constraint in matching score to better employ such correlation. In summary, the contributions of this work are as follows: • We highlight the strong correlation between the target item and relational items can be viewed as a biased knowledge and converted to a regulator embedded in the parameter learning.
• We propose a target-relation regulated network which introduces a constraint by integrating the target and relational item as a whole to characterize the matching scores between the target and historical items.
• We evaluate our model on three real-world datasets and achieve significant improvements over the state-of-the-art baselines for recommendation.

Related work
In this section, we briefly review the related works in two aspects, namely sequential recommendation and item relation modeling.

Sequential recommendation
Recommendation systems have made considerable progress in recent years. The trajectory can be traced from the past Collaborative Filtering (CF) to the current Sequential Recommendation (SR) [3], which relies on the user's behavior sequence to predict the next item that the user might be interested in. Explicit feedback such as ratings was modeled most in the early days. Models like MF [4] and SVD++ [5] has shown a very powerful representation ability. The research focus has transited from explicit feedback to implicit feedback later for their commonness and universality. The training content has also shifted from rating task to ranking task and finally evolved into the current top-k recommendation. Models like MF linearly aggregates the multiplications of latent embeddings, which is insufficient to capture complex user-item interactions. NCF [6] is thus proposed to learn non-linear function via a multi-layer neural network. Sequential patterns is the focus sequential recommendation wants to capture. Recurrent Neural Network (RNN) and its variants like Gated Recurrent Units (GRU) are incorporated into sequential modeling [7]. However, RNN has some shortcomings, such as difficulty in capturing long-term dependency, poor parallelism, and too strict order assumptions for interaction sequence. Subsequently, some Convolutional Neural Networks (CNN) have also been explored and obtained good results [8]. One of the problems of CNN-based models is that they have difficulties in capturing relations between items that are not nearby.
Recently, there are works that employ advanced techniques, e.g., attention mechanism [9][10][11][12] and gating mechanism [13] for sequential recommendation to distinguish the importance of different items in sequence. SASRec [14], based on self-attention mechanism, demonstrated promising results in modeling mutual influence between historical interactions. HGN [13] exploits item co-occurrence as one of the model's building block.

Item relation modeling
Traditional recommendation techniques can perform well when sufficient interaction information is provided. However, we often encounter the problem of data sparsity in practice. In real-world, some relations with concrete semantics exist among items. To introduce more effective information and address the data sparsity, models incorporating item relations have recently got research attention [2,[15][16][17]. They mainly use Knowledge Graph (KG) to learn relation semantics between items and embed them to item embeddings. CFKG [1], which defines a variety of entities and relations, learns the representation over a structured heterogeneous knowledge graph for recommendation. Chorus [2] considers relation types between items and their corresponding temporal dynamics to better capture the evolutional effects of relations. There is a drawback here, which is that it needs handcrafted forms of temporal decay functions. KDA [18] makes better by introducing Fourier transform to model the varying temporal effects of different relational interactions.

Problem formulation
We first formulate the task of sequential recommendation with item relations here. Let U and I denote the set of users and items respectively. All users' interaction history A ¼ fS 1 ; S 2 ; . . . ; S jUj g are given. Each user has an interaction sequence of items happened in the chronological order S u ¼ fs u 1 ; s u 2 ; . . . ; s u t g, where s u i 2 I; 0 � i � t. The task is to choose k items from I that most likely to be of interest to the user based on the historical interactions at time step t + 1. Besides, in the task of knowledge graph embedding happened in the first part of our model, we denote R to be the set of item relations with size h, where relation r 2 R could be complement or substitute, etc. Fig 2 illustrates the overall architecture of TRR which consists of two parts. The first part is for item relation modeling, where we learn item representations from the knowledge graph of item relations and make a good initialization for the subsequent recommendation part. The second part is for recommendation task where the relational items are exploited to measure matching scores between the target and historical items.

Relational knowledge graph embedding
Let's first look at the task of item relation modeling from knowledge graph of item relations. There are often relations between items that we can employ. Take two commonly seen behaviors "also_view", "also_buy" in shopping sites as an example. We denote "also_view" as relation substitute and "also_buy" as relation complement, which are useful for accurate recommendation. For instance, if you have purchased an iPhone, it is very likely that you will show interest to AirPods for that a strong complementary relation exists between them. However, if you have purchased Powerbeats, you most likely don't want to buy AirPods within a short period because they are functionally overlapping and substitutable. Items and relations between them can form a knowledge graph consisting lots of triples {s, r, o}, where s and o are items and r 2 R is a relation. After getting the knowledge graph, the next step is how to train model parameters employing it. Considering the number of relations is small in this knowledge graph, we select TransE [19] from traditional knowledge graph embedding models as the training method. The score function for each triple used in TransE is: f ðs; r; oÞ ¼ ks þ r À ok 2 2 ð1Þ where s, r and o are embeddings of s, r and o respectively. The loss function for this task is: where γ is a fixed margin, σ is the sigmoid function, and ðs 0 i ; r; o 0 i Þ is the i-th sampled negative triplet.

Training of knowledge graph
We have constructed a knowledge graph and corresponding loss function focusing on item relations in the previous step. Then we train the knowledge graph to learn item representations affected by item relations. After enough epochs of training, we turn to the second part of our model: the recommendation task. At that time, item embeddings have integrated the structural information from knowledge graph which cannot be learned in the subsequent recommendation task.

Embedding layer
As shown in Fig 2, the input of TRR is a user's behavior sequence, which contains a series of items in time order. We first convert the input ðs u 1 ; s u 2 ; . . . ; s u t Þ into a fixed-length sequence . . . ; s u L Þ to facilitate subsequent operation, where L denotes the maximum sequence length TRR will process. If the sequence length is greater than L, we truncate it and take the most recent L items, otherwise, we pad the sequence to the fixed length L.
We maintain an item embedding matrix M I 2 R jIj�d , where d is the embedding dimension. The item embedding matrix projects the high-dimensional one-hot representation of an item to a low-dimensional dense representation. Given an interaction sequence s u = ðs u 1 ; s u 2 ; . . . ; s u L Þ, we can form the input embedding matrix: where k r 2 R d is the embedding of the r-th item.

Target-relation regulated representation learning
Sequential dependency is ubiquitous in user-item interactions. To better determine users' current interests, we should look into their historical interactions. However, we must be aware that in many cases, only a few items in the past will play an important role. How do we distinguish which items are important? Given a piece of interaction history for a user, some historical items have no relation with the target item and others have. Historical items that have important relations with the target item should undoubtedly deserve our attention. Because of the close relationship within the target and relational items, the matching degrees between the target item and historical items have a deep correlation with the matching degrees between the relational items and historical items. This can be viewed as biased knowledge and converted to a regulator in the calculation of matching scores. For example, given two users' purchasing history U 1 = {Shoes, MacBook, Milk} and U 2 = {Bose Headphone, Samsung Phone, HuaWei Watch}. Suppose we now predict whether they will be interested in buying an iPhone. By browsing their purchasing history, we can find the target iPhone is complementary with MacBook in U 1 , and a substitutable relationship with Samsung Phone in U 2 . Obviously, U 1 is likely to be interested in iPhone while U 2 most likely won't because it already has a functionally similar one. The above scenario explains the importance of introducing relational items. Now we further elaborate on capturing the item cooccurrence by exploiting relational items. Given another user's interaction history U 3 = {Mac-Book, AirPods, Milk}. Compared with U 1 , U 3 's willingness to buy an iPhone will be greater for that the union of two complementary items MacBook and AirPods leads to greater possibility on buying the target iPhone than just one complementary MacBook in U 1 .
Given the representation of the target item denoted as i. In our mechanism, a strong correlation exists in the measure of matching degrees among the target item and items having relations with the target. To integrate such constraint, we first add the target item with the latest maximum n relational item in each type of relation from history as the query vector: where r 2 R is one of the item relations, j r i denotes the embedding of historical item having relation r with the target item, l r i control the strength of each relational item and q r i represents the query vector with respect to relation r. We list three ways here to aggregate the influence of multiple relational items. The first is to calculate the average, which sets coefficient l r i to be 1/n. In the remaining two, l r i is learned through a multilayer perceptron. Given the relational items J ¼ ðj r 1 ; . . . ; j r n Þ 2 R n�d . We can obtain the aggregated representation by the following three formula: where W 2 R n�n is the trainable parameters. The superscript > denotes the matrix transpose. SUM(A, i) means we sum up the embeddings along the i-th dimension of A and σ is the sigmoid function. TRR adopts one special case, where n = 1 and λ 1 = 1. This means only the latest relational item is used for each type of relation. The impact of different aggregation methods with λ i will be explored in the ablation experiment. Then we use query vector q r i to calculate the matching degrees with items in user's behavior sequence. We use dot product to compute the matching degree: where m 2 [1, L] represents the item position in user's history sequence, vector k m is the item embedding in that position and v r m is the matching degree between q r i and historical item k m . The larger v r m is, the higher probability of this item co-occurring with the target and latest relational item. To reduce noise introduced by irrelevant historical items, we further use a gating layer to adjust their relative matching degrees and then aggregate them with addition: where W 1 2 R L�L , v r 2 R L with v r m as one element in this vector, σ is the sigmoid function and z r is the final adjusted matching degree between the target and historical items under a certain relation. Note that sigmoid is used here instead of softmax because there may be multiple items that are important. Using softmax will reduce the distinction between important items and non-important items. At last, we add up the influence from different relations to synthesize them:

Prediction layer
We have introduced the core mechanism of TRR. Next we describe the prediction layer which combines matrix factorization and the above schema to produce the prediction score.
Since each user or item has a latent representation, the score function for ranking is as the following:ŷ where u T i gets the matching degree from the user-item perspective and the second s pays attention to the influence from the item-item perspective.
To train parameters in the recommendation part of TRR, we need to select an appropriate loss function to optimize. Since the user interactions are sequence of implicit feedback, pairwise ranking loss is used to optimize the proposed model in training. It aims to rank the observed next item (positive) ahead of the accompanying negative sample and is formulated as: where j= 2S u is a negative item sampled from the training set.

On comparison with Chorus
Note that our proposed TRR is different from Chorus [2]. Chorus focuses on the temporal dynamics of item relations, where the representation of each item includes the basic item embedding and temporal evolution from different relations. However, our TRR doesn't contain the time factor for that it has limited stacking effect on our mechanism. Another main difference is that, although Chorus considers the temporal influence of relations between items, it only uses the relation type information. However, it is somewhat vague since the same relation can contain a large number of different items, which leads to low discrimination. In contrast, TRR takes care of relational items of the target item. Greater discrimination can be got since different items have their own representations and contribute different influences, strong or weak, even they all belong to the same relation. Besides, compared to relation type, relational items are homogenous with the target item. This consistency helps in the measure of matching degrees. The third main difference is that Chorus mainly relies on taking both relation types and corresponding temporal dynamics into consideration to obtain performance improvement. Our TRR takes advantage of the strong correlation between the target and relational items as the biased knowledge and converted to a constraint in the calculation of matching scores to help learning parameters.

Evaluation
In this section, we conduct experiments to verify the effectiveness of the proposed TRR. We first describe the datasets, evaluation metrics, baseline methods and experimental settings in detail. Then we report the experimental results and conduct a in-depth analysis.

Datasets
To evaluate performance of the proposed TRR, we do experiments on three public amazon datasets. These datasets not only have the interaction records of users with timestamps, but also the metadata of items. Typical information in metadata contains relations between items like "also_view", "also_buy" and the category information. We have selected 3 representative datasets from the amazon dataset library: Grocery and Gourmet Food (Grocery), Cellphones and Accessories (Cellphones), and Home and Kitchen (Home). The raw data and preprocessing code can be found in github (https://github.com/THUwangcy/ReChorus/tree/SIGIR20/data). Statistics information about the three datasets are summarized in Table 1.

Evaluation protocols
To evaluate the top-N recommendation performance, we employ Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) to measure the recommendation quality as done in many methods. HR@K indicates whether the test item successfully appear in the top-k recommended list and NDCG@K takes the ranking position of correctly recommended items into account.

Comparison methods
To show the effectiveness of the proposed TRR, two groups of baselines are considered. The first group is the general sequential recommendation methods like SASRec [14] and HGN [13]. The other one is models employing item relations like Chorus [2] and KDA [18]. The compared state-of-the-art models are listed as the following: • BPR. A classic CF method applying Bayesian Personalized Ranking to Matrix Factorization [4] for recommendation.
• GMF. A classic CF model using multiple non-linear layers of neural network [6].
• Tensor. A model splitting time into bins and factorizes a three-dimensional tensor for recommendation [20].
• Caser. A CNN-based model which learns high-order dependency via vertical and horizontal convolutions for sequential recommendation [8].
• SASRec. A self-attention based model which can learn long-term dependency and identify relevant items for prediction [14].
• HGN. A attentive model which proposes a hierarchical gating architecture and the Itemitem Product mechanism [13].
• CFKG. A collaborative filtering model learning over a structured knowledge graph [1].

PLOS ONE
• SLRC. A sequential model considering both item relations and their temporal dynamics [21].
• Chorus. A sequential model considering not only item relations but also their evolutional effects along time [2].
• Locker. A sequential model which improves self-attentive mechanism by enhancing shortterm user dynamics modeling. [12].
• KDA. A sequential model considering relational effects and their temporal evolutions using Fourier transform [18].

Implementation details
For comparison purpose, we follow some configurations used in Chorus [2]. The embedding size is set to 64. We use Xavier initializer with a mean of 0 and standard deviation of 0.01 to initialize learning parameters. Adam [22] optimizer is used as the optimization algorithm. We adopt early stopping with the patience of 5 epoch to prevent overfitting, and NDCG@5 is set as the indicator. In particular, we directly use the results of BPR, GMF, GRU4Rec, NARM, CFKG and SLRC reported in Chorus [2].  Table 2 shows the performance comparison of the proposed TRR and other baseline methods on three datasets. We find several observations from this table:

Experimental results
• Sequential models like GRU4Rec and NARM behave better than the traditional collaborative filtering methods (BPR and GMF). This is due to the consideration of sequential dependency. Models like Caser, HGN, SASRec and Locker have achieved extraordinary results due to the high-quality context-aware representations. But they contain more parameters, which requires a lot of data feeding to get a well-trained model. This can be a problem for the sparse user behavior in many datasets.
• The performance of relation-based models like SLRC, Chorus and KDA are among the best baselines. This shows the benefits of exploiting item relations. With regard to KDA, the impressive performance shows the effectiveness of considering item relations and their temporal dynamics by using fourier transform with learnable frequency domain embeddings.
• Finally, we can see that the proposed TRR consistently outperforms the baselines, which confirms the significance of characterizing patterns of item co-occurrence guided by relational items, which transforms the strong correlation imposed by item relations to regulated matching scores between the target and historical items. We conduct t-tests and p-value <0.05 proves that the performance improvements of TRR are statistically significant.

Ablation analysis
In this section, we perform a series of ablation experiments on the proposed TRR to confirm its effectiveness and better understand the impact of each key module.

Impact of the target-relation regulated representation mechanism.
In order to prove effectiveness of integrating the target and relational items as a whole to characterize the matching scores between the target and historical items, we remove the part of relational items in Eq 4, i.e. only the term i will be reserved and denote this variant as TRR_nr. The results are reported in Fig 3. The performance drop of TRR_nr demonstrates the advantage of mining the strong correlation between the target and relational items. It exploits the co-occurrence of three items, thus leading to more accurate scoring.
Impact of gating layer. Our proposed approach TRR leverages a gating layer in the target-relation regulated mechanism to adjust the weight of matching degrees between the target and historical items, which is done in Eq 9. We study the impact of this gating layer by comparing two variants. The first variant is by removing the gating layer from TRR and denoted as TRR-ng. The second variant is by replacing the sigmoid function in Eq 9 with softmax and denoted as TRR-att. The results are reported in Fig 4. The performance degradation in TRRng indicates that it is helpful to further adjust the relative scales within the group of historical items through the gating mechanism. This can improve the distinguishability of important items. Besides, we can see that the overall performance of TRR-att is slightly inferior compared to TRR. The gap is more obvious on the Cellphones dataset. This is because in this dataset, there are more items with important relations in user's behavior sequence, which can be seen from the column "relational ratio in test set" in Table 1. Sigmoid is better compatible with this scenario.
Impact of training method for knowledge graph. We now evaluate the influence of training methods on knowledge graph. Three additional knowledge graph embedding models are compared here with the adopted TransE [19] in TRR, which are TranH [23], ConvR [24], and RotateE [25]. TransH is a translational model where entities have different representations under different relations. ConvR utilizes a convolutional network designed to maximize entity-relation interactions. RotatE defines each relation as a rotation from the source entity to the target entity in the complex vector space. We denote the three models using them as "TRR w TransH", "TRR w ConvR" and "TRR w RotatE" respectively in Table 3. From the results, we can note that newer models like RotatE and ConvR are inferior to the old ones like TransH and TransE (adopted in our model TRR). This can be attributed to the simple relationships in the constructed knowledge graph, where the number of edge types is small. In such case, complex models are prone to overfitting and thus we prefer simple models with fewer parameters.  Impact of relational item number n. To study the effect of different number of relational items incorporated in Eq 4, we vary n in the range of 1 to 5 and use the three aggregation methods described there. The results using Eq 5 are denoted as Mean, the results using Eq 6 as Softmax and the results using Eq 7 as Sigmoid. The performance is reported in Fig 5, where TRR-2 sets n to 2, TRR-3 sets n to 3 and so on. The proposed TRR adopts n = 1 and λ 1 = 1 and the results are marked with red dotted line. We can observe that different datasets have their own suitable aggregation methods, which are determined by the distribution characteristics of datasets. In general, more relational items will cause performance degradation in many cases. This is reasonable since the correlation between relational items often weakens over time. The older the relational item is, the less meaningful it is for the target item, and the possibility of introducing noise is increased instead.

Influence of hyper-parameter
Proper setting of hyper-parameter is important for model performance. In this section, we evaluate the proposed TRR to investigate the impact of history length L in our model, which is used to calculate the matching scores between the target and historical items in the target-relation regulated representation learning mechanism. The results are shown in Fig 6 measured with NDCG@5. According to the results, we can observe that it's not good if L is too large or too small. Small L leads to insufficient context information while large value will cause interference from many irrelevant items. The best setting varies for different datasets, which depends on the data size and sparsity. In addition, the results do not fluctuate much near the appropriate length, which shows the stability and robustness of our model in delivering superior prediction accuracy.

Conclusion
In this work, we focus on employing the correlation between the target and relational items and propose a target-relation regulated mechanism TRR for sequential recommendation with item relations. The strong relationships between them extend to the calculation of matching degrees and are informative to the pattern of item co-occurrence. TRR utilizes such knowledge and integrates relational items to the calculation of matching scores between the target and historical items. Extensive experimental results and analysis on three real-world datasets demonstrate that our proposed model TRR consistently outperforms the state-of-the-art methods. Due to the data sparsity, Relations may actually exist between some items but not observed. The performance improvement of our TRR is dependent on these relational items however. In the future, we plan to extend TRR to incorporate other auxiliary information to enrich relational items of the target item. Auxiliary information such as category, user review and knowledge base can be employed to get more appropriate relational items from different perspectives.